Many eukaryotic genes undergo alternative splicing to produce multiple RNA isoforms, thus expanding the coding and regulatory potential of the genome. The transcriptome of specific biological samples are constructed from high throughput sequencing experiments by mapping short or long reads to a reference genome and assembling alignments into transcript architectures. The resulting Gene Transfer Format (GTF) files describe newly identified transcripts as sets of exonic coordinates, but they do not contain information about the coding sequences (CDSs; also known as ORFs) or possible biological functions of these transcripts. We previously published {factR}, an R package for the construction of CDSs for novel RNA isoforms. {factR} also predicts the domain organisation of the protein products of these isoforms, and identifies transcripts that are susceptible to nonsense-mediated decay (NMD; a pathway destabilizing mRNAs with premature translation termination codons).
This new version of {factR}, called {factR2}, extends the functionality of its predecessor by adding tools to probe for the regulatory potential of alternative exons and enhances its user-friendliness by adopting an R object class. {factR2} pinpoints alternative splicing events that trigger NMD, and it tests this regulation by correlating exon splicing inclusion with gene expression level. Additional features of {factR2} include direct quantification of exonic sequence conservation scores, an interactive visualisation of transcript and domain architectures and a two-way plot of expression or exon splicing levels.
{factR2} is currently available on GitHub and can be installed using {devtools}:
{factR2} only requires a custom transcriptome file in GTF
format as input. This file is typically generated by short-read
assemblers such as StringTie2 (Kovaka et al. 2019)
and Cufflinks,
(Trapnell et al. 2012) or by long-read
aligners such as minimap2
(Li 2018) and Isoquant (Prjibelski et al. 2023). We provide a sample
custom transcriptome file assembled from PacBio HiFi data using Isoquant
that can be used to test {factR2} functionalities (see “Creating a factRObject” for
more details).
Several {factR2} functions require reference genomic sequences and transcript annotations. To make it easy for users to obtain these files, we provide a convenient system to retrieve reference files of common model organisms from the GENCODE and Ensembl databases. Users may optionally use their own reference files as inputs.
The following packages are required to perform this workflow:
User-input custom transcriptome data are stored and analysed as part of an S4 class structure called factRObject. To create a factRObject, users can provide the local path to their custom GTF file, or a {GenomicRanges} object containing transcript architecture information of custom transcriptome. {factR2} comes with a sample GTF file assembled from long-read sequences of transcripts from postnatal mouse brain cortex (Joglekar et al. 2023). The local path to this GTF file can be retrieved as follows:
In addition to a custom GTF file, users are required to define the
genome used for read assembly. This is done by providing input to the
reference argument and accepted inputs include the name of
the organism (“Homo sapiens” or “Human”) or the specific ID
of {factR2} supported genomes. This ID list can be retrieved by
running the listSupportedGenomes() in console:
listSupportedGenomes()
#> ID species database release.date
#> 1 v44 Homo sapiens GENCODE 7.2023
#> 2 v43 Homo sapiens GENCODE 2.2023
#> 3 v42 Homo sapiens GENCODE 10.2022
#> 4 v41 Homo sapiens GENCODE 7.2022
#> 5 v40 Homo sapiens GENCODE 4.2022
#> 6 v39 Homo sapiens GENCODE 12.2021
#> 7 v38 Homo sapiens GENCODE 5.2021
#> 8 v37 Homo sapiens GENCODE 2.2021
#> 9 v36 Homo sapiens GENCODE 10.2020
#> 10 v35 Homo sapiens GENCODE 8.2020
#> 11 v34 Homo sapiens GENCODE 4.2020
#> 12 v33 Homo sapiens GENCODE 1.2020
#> 13 v32 Homo sapiens GENCODE 9.2019
#> 14 v31 Homo sapiens GENCODE 6.2019
#> 15 v30 Homo sapiens GENCODE 4.2019
#> 16 v29 Homo sapiens GENCODE 10.2018
#> 17 v28 Homo sapiens GENCODE 4.2018
#> 18 v27 Homo sapiens GENCODE 8.2017
#> 19 v26 Homo sapiens GENCODE 3.2017
#> 20 v25 Homo sapiens GENCODE 7.2016
#> 21 v24 Homo sapiens GENCODE 12.2015
#> 22 v23 Homo sapiens GENCODE 7.2015
#> 23 v22 Homo sapiens GENCODE 3.2015
#> 24 v21 Homo sapiens GENCODE 10.2014
#> 25 v20 Homo sapiens GENCODE 8.2014
#> [ reached 'max' / getOption("max.print") -- omitted 10 rows ]With these inputs, a factRObject can be created as follows:
fobj <- createfactRObject(gtf.file, reference = "vM33")
#> 🡆 Checking inputs
#> 🡆 Checking factRObject
#> 🡆 Adding custom transcriptome
#> ℹ Importing from local directory
#> 🡆 Adding annotation
#> ℹ Importing from URL
#> 🡆 Adding genome sequence
#> ℹ Using BSgenome object
#> 🡆 Matching chromosome names
#> 🡆 Matching gene information
#> Number of mismatched gene_ids found: 83520
#> --> Attempting to match ensembl gene_ids...
#> --> No ensembl gene ids found in query
#> ---> Attempting to match gene_ids by finding overlapping coordinates...
#> ---> 65203 gene_id matched
#> Total gene_ids corrected: 65203
#> Remaining number of mismatched gene_ids: 18317
#> 🡆 Creating factRset objects
#>
#> ℹ Adding gene information
#>
#> ℹ Adding transcript information
#>
#> ℹ Adding alternative splicing information
#>
#> 🡆 Annotating novel transcripts
#>
#> 🡆 Annotating novel AS events
#>
#> ℹ factRobject created!The function above extracts and stores gene-, transcript- and exon-level information from a custom GTF file and, by default, corrects for inconsistencies in gene annotation to the reference transcriptome.
{factR2} comes with wrapper functions to access the information stored within a factRObject. A preview of the object’s contents can be printed by typing the variable name:
fobj
#> class: factRObject [version 0.99.0]
#> # transcriptome:
#> 36931 genes;
#> 115055 transcripts [99855 novel];
#> 0 coding transcripts
#> # active set: AS
#> # samples (0):Transcripts that contain distinct sets of internal exons are defined as “novel”. Note that the sample GTF lack coding sequence information and thus, none of the transcripts are currently annotated as “coding”.
The metadata of genes, transcripts and exons are segregated into
individual “set” data in a factRObject. One of these “set” data
will be assigned as the active “set” and users may switch the active
“set” using the activeSet() function:
The exact names of the “set” can be printed out using
listSets():
The metadata associated with the current active “set” can be
displayed as a dataframe using the features() function:
features(fobj)
#> # A tibble: 115,055 × 8
#> transcript_id gene_id gene_name strand width novel cds nmd
#> <chr> <chr> <chr> <fct> <int> <chr> <chr> <chr>
#> 1 transcript51190.chr3.n… ENSMUS… Gnai3 - 1333 yes no no
#> 2 transcript62347.chr16.… ENSMUS… Cdc45 - 2279 yes no no
#> 3 transcript62383.chr16.… ENSMUS… Cdc45 - 679 yes no no
#> 4 transcript62392.chr16.… ENSMUS… Cdc45 - 2282 yes no no
#> 5 transcript45445.chrX.n… ENSMUS… Scml2 + 1237 yes no no
#> 6 transcript45454.chrX.n… ENSMUS… Scml2 + 677 yes no no
#> 7 transcript45461.chrX.n… ENSMUS… Scml2 + 451 yes no no
#> 8 transcript45463.chrX.n… ENSMUS… Scml2 + 669 yes no no
#> 9 transcript45465.chrX.n… ENSMUS… Scml2 + 464 yes no no
#> 10 transcript45467.chrX.n… ENSMUS… Scml2 + 425 yes no no
#> # ℹ 115,045 more rowsTo show more columns as an R dataframe, set show_more
argument to TRUE:
features(fobj, show_more = TRUE)
#> transcript_id
#> transcript51190.chr3.nnic transcript51190.chr3.nnic
#> transcript62347.chr16.nnic transcript62347.chr16.nnic
#> transcript62383.chr16.nnic transcript62383.chr16.nnic
#> transcript62392.chr16.nnic transcript62392.chr16.nnic
#> transcript45445.chrX.nnic transcript45445.chrX.nnic
#> transcript45454.chrX.nnic transcript45454.chrX.nnic
#> transcript45461.chrX.nnic transcript45461.chrX.nnic
#> transcript45463.chrX.nnic transcript45463.chrX.nnic
#> transcript45465.chrX.nnic transcript45465.chrX.nnic
#> transcript45467.chrX.nnic transcript45467.chrX.nnic
#> transcript93435.chr11.nnic transcript93435.chr11.nnic
#> transcript93455.chr11.nnic transcript93455.chr11.nnic
#> gene_id gene_name strand width
#> transcript51190.chr3.nnic ENSMUSG00000000001.5 Gnai3 - 1333
#> transcript62347.chr16.nnic ENSMUSG00000000028.16 Cdc45 - 2279
#> transcript62383.chr16.nnic ENSMUSG00000000028.16 Cdc45 - 679
#> transcript62392.chr16.nnic ENSMUSG00000000028.16 Cdc45 - 2282
#> transcript45445.chrX.nnic ENSMUSG00000000037.18 Scml2 + 1237
#> transcript45454.chrX.nnic ENSMUSG00000000037.18 Scml2 + 677
#> transcript45461.chrX.nnic ENSMUSG00000000037.18 Scml2 + 451
#> transcript45463.chrX.nnic ENSMUSG00000000037.18 Scml2 + 669
#> transcript45465.chrX.nnic ENSMUSG00000000037.18 Scml2 + 464
#> transcript45467.chrX.nnic ENSMUSG00000000037.18 Scml2 + 425
#> transcript93435.chr11.nnic ENSMUSG00000000049.12 Apoh + 1389
#> transcript93455.chr11.nnic ENSMUSG00000000049.12 Apoh + 1020
#> novel cds nmd
#> transcript51190.chr3.nnic yes no no
#> transcript62347.chr16.nnic yes no no
#> transcript62383.chr16.nnic yes no no
#> transcript62392.chr16.nnic yes no no
#> transcript45445.chrX.nnic yes no no
#> transcript45454.chrX.nnic yes no no
#> transcript45461.chrX.nnic yes no no
#> transcript45463.chrX.nnic yes no no
#> transcript45465.chrX.nnic yes no no
#> transcript45467.chrX.nnic yes no no
#> transcript93435.chr11.nnic yes no no
#> transcript93455.chr11.nnic yes no no
#> [ reached 'max' / getOption("max.print") -- omitted 115043 rows ]To display transcripts expressed from a particular gene, users can
provide IDs or names of genes as input to the second argument of
features(). For example, to query for the transcripts from
U2af1 gene, we can type:
features(fobj, "U2af1", show_more = TRUE)
#> transcript_id
#> transcript80929.chr17.nnic transcript80929.chr17.nnic
#> transcript80960.chr17.nnic transcript80960.chr17.nnic
#> transcript80965.chr17.nnic transcript80965.chr17.nnic
#> transcript81001.chr17.nnic transcript81001.chr17.nnic
#> transcript81044.chr17.nnic transcript81044.chr17.nnic
#> gene_id gene_name strand width
#> transcript80929.chr17.nnic ENSMUSG00000061613.14 U2af1 - 1038
#> transcript80960.chr17.nnic ENSMUSG00000061613.14 U2af1 - 971
#> transcript80965.chr17.nnic ENSMUSG00000061613.14 U2af1 - 971
#> transcript81001.chr17.nnic ENSMUSG00000061613.14 U2af1 - 842
#> transcript81044.chr17.nnic ENSMUSG00000061613.14 U2af1 - 785
#> novel cds nmd
#> transcript80929.chr17.nnic no no no
#> transcript80960.chr17.nnic no no no
#> transcript80965.chr17.nnic no no no
#> transcript81001.chr17.nnic yes no no
#> transcript81044.chr17.nnic no no noMultiple genes can be displayed by providing additional gene names:
features(fobj, "U2af1", "U2af2", show_more = TRUE)
#> transcript_id
#> transcript2976.chr7.nnic transcript2976.chr7.nnic
#> transcript2982.chr7.nnic transcript2982.chr7.nnic
#> transcript3019.chr7.nnic transcript3019.chr7.nnic
#> transcript3021.chr7.nnic transcript3021.chr7.nnic
#> transcript80929.chr17.nnic transcript80929.chr17.nnic
#> transcript80960.chr17.nnic transcript80960.chr17.nnic
#> transcript80965.chr17.nnic transcript80965.chr17.nnic
#> transcript81001.chr17.nnic transcript81001.chr17.nnic
#> transcript81044.chr17.nnic transcript81044.chr17.nnic
#> gene_id gene_name strand width
#> transcript2976.chr7.nnic ENSMUSG00000030435.17 U2af2 + 1191
#> transcript2982.chr7.nnic ENSMUSG00000030435.17 U2af2 + 1800
#> transcript3019.chr7.nnic ENSMUSG00000030435.17 U2af2 + 790
#> transcript3021.chr7.nnic ENSMUSG00000030435.17 U2af2 + 1354
#> transcript80929.chr17.nnic ENSMUSG00000061613.14 U2af1 - 1038
#> transcript80960.chr17.nnic ENSMUSG00000061613.14 U2af1 - 971
#> transcript80965.chr17.nnic ENSMUSG00000061613.14 U2af1 - 971
#> transcript81001.chr17.nnic ENSMUSG00000061613.14 U2af1 - 842
#> transcript81044.chr17.nnic ENSMUSG00000061613.14 U2af1 - 785
#> novel cds nmd
#> transcript2976.chr7.nnic no no no
#> transcript2982.chr7.nnic yes no no
#> transcript3019.chr7.nnic yes no no
#> transcript3021.chr7.nnic yes no no
#> transcript80929.chr17.nnic no no no
#> transcript80960.chr17.nnic no no no
#> transcript80965.chr17.nnic no no no
#> transcript81001.chr17.nnic yes no no
#> transcript81044.chr17.nnic no no noTo quickly display the metadata of a different “set”, users may
modify the set argument:
features(fobj, "U2af1", set = "AS", show_more = TRUE)
#> AS_id gene_id gene_name coord
#> AS10417 AS10417 ENSMUSG00000061613.14 U2af1 chr17:31870620-31870686
#> AS10419 AS10419 ENSMUSG00000061613.14 U2af1 chr17:31871476-31871542
#> AS10421 AS10421 ENSMUSG00000061613.14 U2af1 chr17:31873953-31874040
#> AStype strand width novel
#> AS10417 CE - 67 no
#> AS10419 CE - 67 no
#> AS10421 CE - 88 noor use convenient wrapper functions (see ?factR-meta for
more):
ase(fobj, "U2af1", show_more = TRUE)
#> AS_id gene_id gene_name coord
#> AS10417 AS10417 ENSMUSG00000061613.14 U2af1 chr17:31870620-31870686
#> AS10419 AS10419 ENSMUSG00000061613.14 U2af1 chr17:31871476-31871542
#> AS10421 AS10421 ENSMUSG00000061613.14 U2af1 chr17:31873953-31874040
#> AStype strand width novel
#> AS10417 CE - 67 no
#> AS10419 CE - 67 no
#> AS10421 CE - 88 noEach alternative splicing event has been assigned a internal unique ID for easy referencing and this can be found in the column “AS_id”.
The above features() function and other similar assessor
functions outputs a dataframe that can be saved as a new variable for
further wrangling. Alternatively, the output dataframe can be directly
piped to downstream functions. In the example below, we show how to
quickly count the number of events by the splicing type (AStype):
ase(fobj) %>%
group_by(AStype) %>%
tally()
#> ℹ Set `show_more to TRUE to show more info`
#> # A tibble: 6 × 2
#> AStype n
#> <fct> <int>
#> 1 CE 9298
#> 2 AD 8134
#> 3 AA 6391
#> 4 AF 11099
#> 5 AL 8705
#> 6 RI 4570{factR2} is pre-built with functions to visualise transcript architectures on an interactive plot:
Exon structures may sometimes appear thin and inconspicuous
especially when flanking introns are extremely long. To focus on the
exon architectures, users may set the rescale_introns
argument to TRUE:
{factR2} contains 5 key functions to probe for biological function of custom transcripts:
buildCDS(): to construct CDS informationgetAAsequence(): to translate amino acid sequencespredictNMD(): to predict for NMD-sensitive
transcriptstestASNMDevents(): to pinpoint splicing events leading
to NMDgetAScons(): to quantify exon-level sequence
conservation scoresThe above functions can be simultaneously performed by running the
runfactR() pipeline and this will update the metadata
stored within the factRObject:
fobj <- runfactR(fobj)
#> 🡆 Building CDS information
#> Searching for reference mRNAs in query
#> 16 reference mRNAs found and its CDS were assigned
#> Building database of annotated ATG codons
#> Selecting best ATG start codon for remaining transcripts and determining open-reading frame
#> 44009 new CDSs constructed
#>
#> Summary: Out of 115056 transcripts in `gtf`,
#> 44025 transcript CDSs were built
#> ℹ Updating transcript feature data
#>
#> 🡆 Running transcript-level NMD prediction
#>
#> Predicting NMD sensitivities for 44025 mRNAs
#> ℹ Updating transcript feature data
#>
#> 🡆 Translating amino acid sequences
#>
#> 🡆 Running AS-NMD testing
#>
#> ℹ Getting best reference for AS-NMD testing
#>
#> ℹ Testing AS-NMD exons
#>
#> ℹ Updating AS feature data
#>
#> 🡆 Quantifiying conservation scores
#>
#> ℹ Retrieving phastCons35way.UCSC.mm39 scores
#>
#> ℹ Quantifying `exon` conservation scores with 0 paddingThe number of newly-identified coding transcripts is now reflected in the object summary:
fobj
#> class: factRObject [version 0.99.0]
#> # transcriptome:
#> 36931 genes;
#> 115055 transcripts [99855 novel];
#> 44025 coding transcripts
#> # active set: transcript
#> # samples (0):Coding transcripts that are predicted to be NMD-sensitive are now annotated as “yes” under the column “nmd” in the transcript metadata:
features(fobj, "U2af1", show_more = TRUE)
#> transcript_id
#> transcript80929.chr17.nnic transcript80929.chr17.nnic
#> transcript80960.chr17.nnic transcript80960.chr17.nnic
#> transcript80965.chr17.nnic transcript80965.chr17.nnic
#> transcript81001.chr17.nnic transcript81001.chr17.nnic
#> transcript81044.chr17.nnic transcript81044.chr17.nnic
#> gene_id gene_name strand width
#> transcript80929.chr17.nnic ENSMUSG00000061613.14 U2af1 - 1038
#> transcript80960.chr17.nnic ENSMUSG00000061613.14 U2af1 - 971
#> transcript80965.chr17.nnic ENSMUSG00000061613.14 U2af1 - 971
#> transcript81001.chr17.nnic ENSMUSG00000061613.14 U2af1 - 842
#> transcript81044.chr17.nnic ENSMUSG00000061613.14 U2af1 - 785
#> novel cds nmd stop_to_lastEJ num_of_downEJs
#> transcript80929.chr17.nnic no yes yes 417 5
#> transcript80960.chr17.nnic no yes no -142 0
#> transcript80965.chr17.nnic no yes no -142 0
#> transcript81001.chr17.nnic yes yes yes 430 5
#> transcript81044.chr17.nnic no yes yes 342 4
#> 3'UTR_length is_NMD
#> transcript80929.chr17.nnic 689 TRUE
#> transcript80960.chr17.nnic 130 FALSE
#> transcript80965.chr17.nnic 130 FALSE
#> transcript81001.chr17.nnic 702 TRUE
#> transcript81044.chr17.nnic 614 TRUEThe total number of NMD-sensitive transcripts can be queried as such:
features(fobj) %>%
group_by(nmd) %>%
tally()
#> ℹ Set `show_more to TRUE to show more info`
#> # A tibble: 2 × 2
#> nmd n
#> <chr> <int>
#> 1 no 111665
#> 2 yes 3390Now, plotTranscripts() will distinguish between CDS and
untranslated region (UTR) segments:
By default, the runfactR pipeline will compute the
sequence conservation scores of the entire exon. To determine the
conservation of intronic sequences flanking these alternative exons (50
base pairs on each side), users may rerun the getAScons()
function and the newly-computed scores will be reflected as a separate
column in the “AS” set metadata:
fobj <- getAScons(fobj, type = "flanks", padding = 50)
#> 🡆 Quantifiying conservation scores
#> ℹ Retrieving phastCons35way.UCSC.mm39 scores
#> ℹ Quantifying `flanks` conservation scores with 50 padding
ase(fobj, show_more = TRUE)
#> AS_id gene_id gene_name coord
#> AS01129 AS01129 ENSMUSG00000033813.16 Tcea1 chr1:4957055-4957285
#> AS01642 AS01642 ENSMUSG00000025907.15 Rb1cc1 chr1:6285182-6286285
#> AS01647 AS01647 ENSMUSG00000025907.15 Rb1cc1 chr1:6292904-6293402
#> AS01649 AS01649 ENSMUSG00000025907.15 Rb1cc1 chr1:6304312-6304452
#> AS01650 AS01650 ENSMUSG00000025907.15 Rb1cc1 chr1:6315204-6315381
#> AS01651 AS01651 ENSMUSG00000025907.15 Rb1cc1 chr1:6315569-6315648
#> AS01652 AS01652 ENSMUSG00000025907.15 Rb1cc1 chr1:6319884-6319970
#> AS01657 AS01657 ENSMUSG00000025907.15 Rb1cc1 chr1:6331320-6332068
#> AStype strand width novel ASNMDtype ASNMD.in.cds Cons.exon.pad0
#> AS01129 AD + 231 yes <NA> NA 0.043722944
#> AS01642 AD + 1104 yes <NA> NA 0.138768116
#> AS01647 AF + 499 yes <NA> NA 0.000000000
#> AS01649 RI + 141 yes <NA> NA 0.121985816
#> AS01650 RI + 178 yes <NA> NA 0.110674157
#> AS01651 RI + 80 yes <NA> NA 0.132500000
#> AS01652 RI + 87 yes <NA> NA 0.965517241
#> AS01657 AD + 749 yes <NA> NA 0.005740988
#> Cons.flanks.pad50
#> AS01129 0.462
#> AS01642 0.389
#> AS01647 0.026
#> AS01649 0.997
#> AS01650 0.999
#> AS01651 0.979
#> AS01652 0.921
#> AS01657 0.515
#> [ reached 'max' / getOption("max.print") -- omitted 48189 rows ]AS-NMD events are classified as “Stimulating” if the longer form (exon inclusion) of the transcript is NMD-sensitive or “Repressing” if the shorter form (exon skipping) of the transcript is NMD-sensitive. Both AS-NMD event types were detected in the U2af1 example:
ase(fobj, "U2af1", show_more = TRUE)
#> AS_id gene_id gene_name coord
#> AS10417 AS10417 ENSMUSG00000061613.14 U2af1 chr17:31870620-31870686
#> AS10419 AS10419 ENSMUSG00000061613.14 U2af1 chr17:31871476-31871542
#> AS10421 AS10421 ENSMUSG00000061613.14 U2af1 chr17:31873953-31874040
#> AStype strand width novel ASNMDtype ASNMD.in.cds Cons.exon.pad0
#> AS10417 CE - 67 no <NA> NA 1
#> AS10419 CE - 67 no Stimulating TRUE 1
#> AS10421 CE - 88 no Repressing TRUE 1
#> Cons.flanks.pad50
#> AS10417 1.000
#> AS10419 1.000
#> AS10421 0.653This can be visually verified by running
plotTranscripts() with coordinates as input:
To further test the regulatory function of these AS-NMD events, we
can correlate the exon inclusion levels (PSI or Percent Spliced In) of
AS-NMD events to its gene expression level. {factR2} can
compute gene expression counts and PSI values from transcript-level
expression count matrices. To demonstrate this, we provide a sample
transcript count matrix of all detected isoforms in postnatal mouse
brain at two developmental timepoints. This count matrix can be added to
the factRObject using the addTxCounts()
function:
counts <- system.file("extdata/pb_expression.tsv.gz", package = "factR2")
fobj <- addTxCounts(fobj, counts)
#> 🡆 Adding expression data
#> ℹ Importing from local file
#> 🡆 Creating samples metadata
#> 🡆 Processing expression data
#> ℹ Adding gene counts
#> ℹ Adding spliced-event counts
#> ℹ Normalizing countsUsers may choose to use pre-computed PSI values from other splicing
quantification tool by providing the path to the PSI matrix to the
psi argument.
To test the correlation between exon PSI values and gene expression
levels, users may use the testGeneCorr() function. By
default, this function will perform a two-sided Pearson’s product moment
correlation on variance-stabilised PSI and normalised gene expression
expression. Users may customise the parameters of the correlation test
by parsing addition arguments to the R cor.test() function.
In the example below, we show how one would modify the
testGeneCorr() function to carry out one-sided test:
The above function creates 2 new columns in AS metadata:
gene.cor.estimate which returns the correlation coefficient
of the test and gene.cor.pval which returns the unadjusted
significant level. For AS-NMD events of type “Stimulating”, the PSI
values have been transformed to 1-PSI so that the
correlation coefficient of values for gene.cor.estimate are
positive. In our example below, the AS-NMD stimulating event
AS10419 showed significant correlation with U2af gene
expression.
ase(fobj, "U2af1", show_more = TRUE)
#> AS_id gene_id gene_name coord
#> AS10417 AS10417 ENSMUSG00000061613.14 U2af1 chr17:31870620-31870686
#> AS10419 AS10419 ENSMUSG00000061613.14 U2af1 chr17:31871476-31871542
#> AS10421 AS10421 ENSMUSG00000061613.14 U2af1 chr17:31873953-31874040
#> AStype strand width novel ASNMDtype ASNMD.in.cds Cons.exon.pad0
#> AS10417 CE - 67 no <NA> NA 1
#> AS10419 CE - 67 no Stimulating TRUE 1
#> AS10421 CE - 88 no Repressing TRUE 1
#> Cons.flanks.pad50 gene.cor.estimate gene.cor.pval
#> AS10417 1.000 0.813098344 0.0006501011
#> AS10419 1.000 0.834649627 0.0003661430
#> AS10421 0.653 -0.001520686 0.5018711510We can plot a two-way plot of U2af1 expression against AS10419 PSI
values using the plot2way() function:
factR2 can also inspect domain structure of protein products
encoded by newly identified mRNA isoforms using its
predictDomains() function. This function requires
connection to the online PFAM database and may require a substantial
amount of time and stable internet connection to query multiple
transcripts simultaneously. To quickly explore and visualise the output
of the predictDomains() tool, user may prefer to use the
plotDomains() function. For example, the following will
predict domain structure for all transcripts of the U2af1 gene:
If you want to predict domains for all new transcripts and do not mind waiting for some time depending on the connection speed and the PFAM server load, run the following:
The factRObject can be saved as an RDS file and restored later for further analysis:
Key data stored in a factRObject (updated GTF file and
metadata files) can be exported as text files in the local workding
directory using the exportAll function:
Alternatively, users may use the exportGTF() and
exportTable() functions to export individual data:
This workflow was conducted on:
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BSgenome.Mmusculus.UCSC.mm39_1.4.3 BSgenome_1.66.3
#> [3] Biostrings_2.66.0 XVector_0.38.0
#> [5] rtracklayer_1.58.0 GenomicRanges_1.50.2
#> [7] GenomeInfoDb_1.34.9 IRanges_2.32.0
#> [9] S4Vectors_0.36.2 BiocGenerics_0.44.0
#> [11] lubridate_1.9.2 forcats_1.0.0
#> [13] stringr_1.5.0 dplyr_1.1.3
#> [15] purrr_1.0.2 readr_2.1.4
#> [17] tidyr_1.3.0 tibble_3.2.1
#> [19] ggplot2_3.4.3 tidyverse_2.0.0
#> [21] factR2_0.99.1
#>
#> loaded via a namespace (and not attached):
#> [1] AnnotationHub_3.6.0 BiocFileCache_2.6.1
#> [3] lazyeval_0.2.2 splines_4.2.1
#> [5] BiocParallel_1.32.6 crosstalk_1.2.0
#> [7] usethis_2.2.2 digest_0.6.33
#> [9] htmltools_0.5.6 rsconnect_1.0.2
#> [11] fansi_1.0.4 magrittr_2.0.3
#> [13] memoise_2.0.1 tzdb_0.4.0
#> [15] remotes_2.4.2.1 annotate_1.76.0
#> [17] matrixStats_1.0.0 timechange_0.2.0
#> [19] pkgdown_2.0.7 prettyunits_1.1.1
#> [21] colorspace_2.1-0 rappdirs_0.3.3
#> [23] blob_1.2.4 xfun_0.40
#> [25] callr_3.7.3 crayon_1.5.2
#> [27] RCurl_1.98-1.12 jsonlite_1.8.7
#> [29] glue_1.6.2 gtable_0.3.4
#> [31] zlibbioc_1.44.0 DelayedArray_0.24.0
#> [33] pkgbuild_1.4.2 Rhdf5lib_1.20.0
#> [35] HDF5Array_1.26.0 scales_1.2.1
#> [37] DBI_1.1.3 miniUI_0.1.1.1
#> [39] Rcpp_1.0.11 viridisLite_0.4.2
#> [41] xtable_1.8-4 progress_1.2.2
#> [43] bit_4.0.5 profvis_0.3.8
#> [45] htmlwidgets_1.6.2 httr_1.4.7
#> [47] RColorBrewer_1.1-3 ellipsis_0.3.2
#> [49] farver_2.1.1 urlchecker_1.0.1
#> [51] pkgconfig_2.0.3 XML_3.99-0.14
#> [53] sass_0.4.7 dbplyr_2.3.3
#> [55] locfit_1.5-9.8 utf8_1.2.3
#> [57] tidyselect_1.2.0 labeling_0.4.3
#> [59] rlang_1.1.1 later_1.3.1
#> [61] AnnotationDbi_1.60.2 BiocVersion_3.16.0
#> [63] munsell_0.5.0 tools_4.2.1
#> [65] cachem_1.0.8 cli_3.6.1
#> [67] generics_0.1.3 RSQLite_2.3.1
#> [69] devtools_2.4.5 evaluate_0.21
#> [71] fastmap_1.1.1 yaml_2.3.7
#> [73] processx_3.8.2 knitr_1.44
#> [75] bit64_4.0.5 fs_1.6.3
#> [77] KEGGREST_1.38.0 nlme_3.1-163
#> [79] pbapply_1.7-2 whisker_0.4.1
#> [81] mime_0.12 xml2_1.3.5
#> [83] biomaRt_2.54.1 compiler_4.2.1
#> [85] rstudioapi_0.15.0 interactiveDisplayBase_1.36.0
#> [87] filelock_1.0.2 plotly_4.10.2
#> [89] curl_5.0.2 png_0.1-8
#> [91] geneplotter_1.76.0 bslib_0.5.1
#> [93] stringi_1.7.12 ps_1.7.5
#> [95] GenomicFeatures_1.50.4 desc_1.4.2
#> [97] lattice_0.21-8 Matrix_1.6-1
#> [99] vctrs_0.6.3 rhdf5filters_1.10.1
#> [ reached getOption("max.print") -- omitted 35 entries ]